Goto

Collaborating Authors

 training specification


Appendix and Training Specification

Neural Information Processing Systems

In all environments, we use a Transformer architecture with four layers and four self-attention heads. The total input vocabulary of the model is V (N + M +2) to account for states, actions, rewards, and rewards-to-go, but the output linear layer produces logits only over a vocabulary of size V; output tokens can be interpreted unambiguously because their offset is uniquely determined by that of the previous input. The dimension of each token embedding is 128. Dropout is applied at the end of each block with probability 0.1. We follow the learning rate scheduling of (Radford et al., 2018), increasing linearly from 0 to 2.5 10 4 over the course of 2000 updates.